Use INE mobility data to explore the value of CORINE land cover data when modeling human mobility in cities.
Can random forest regression with land cover variables improve predictions compared to a simple linear gravity model?
Given the large number of land cover variables (52, when including combined origin and destination data) and their uncertain relationship to human mobility, random forests may help incorporate this data into a model without making assumptions about interactions (i.e. between origin and destination variables) or transformations (i.e. log scale). Additionally, random forest models indicate the relative importance of each variable, which may also be interesting.
Which model provides the best predictions for new cities (and does RF struggle with larger cities that have out-of-sample flows)?
Given that INE provides mobility data that covers nearly all residents of Spain, a model of mobility is more useful if it can provide accurate predictions for other cities. To test this, I model mobility using data for the ten cities with the largest number of mobility areas (Madrid, Barcelona, Sevilla, Valencia, Zaragoza, Malaga, Las Palmas de Gran Canaria, Cordoba, Bilbao, and Palma de Mallorca) and the test by leaving one city out sequentially and remodeling the data the testing it on the left-out city. I use Diebold-Mariano tests to compare the predictions for each city, along with comparing the root means of the squared errors of the predictions.
When mapped, are there substantive difference between the model predictions?
While the Diebold-Mariano tests might be interesting, the RF models require vastly more time and computational resources than the simple gravity models. As such, any improvements should be substantively different, not just statistically different, for this method to be useful for policymakers, for example, in cities which do not have ground-truth mobility data to rely on.
One way to check for substantive differences is to map the predictions of combined models, alongside the observed mobility in those cities.
INE provides data on flows between mobility areas (roughly barri-sized) for all Wednesdays and Sundays since the beginning of the pandemic. I have limited the data to only September and October of 2021, as these were the months where Covid cases and restrictions make it most likely to mobility approximated “normality.” For the analysis described here, I have also limited it to only Wednesdays. As mentioned before, the data used for modeling includes only mobility between areas within the municipalities of the ten cities.
Linear Gravity Model (LM)
Linear multilevel model of flows between mobility areas with the following independent variables: population of destination and origin, area of destination and origin, distance between destination and origin, and city as a level. All numeric variables are on the log scale.
Modeled using lme4.
Random Forest (RF)
Random forest regression with 500 trees including all of the numeric variables in the LM, the 52 land cover variables, and dummy variables for the ten cities.
Modeled using randomForest.
First, summary information about the two models:
rf
##
## Call:
## randomForest(formula = flujo ~ ., data = train.all, importance = T)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 22
##
## Mean of squared residuals: 740.4622
## % Var explained: 95.71
varImpPlot(rf, type = 1)
summary(lm)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## log(flujo) ~ log(pob_destino) + log(area_destino) + log(pob_residencia) +
## log(area_residencia) + log(dist) + (1 | city_destino)
## Data: data
##
## REML criterion at convergence: 251887
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.0675 -0.7084 -0.0054 0.7048 4.8286
##
## Random effects:
## Groups Name Variance Std.Dev.
## city_destino (Intercept) 0.1712 0.4137
## Residual 0.4184 0.6468
## Number of obs: 128020, groups: city_destino, 10
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -3.567521 0.153056 -23.31
## log(pob_destino) 0.186539 0.005465 34.13
## log(area_destino) 0.341906 0.001701 201.01
## log(pob_residencia) 0.564808 0.005646 100.04
## log(area_residencia) 0.173799 0.001775 97.92
## log(dist) -0.862350 0.003114 -276.94
##
## Correlation of Fixed Effects:
## (Intr) lg(pb_d) lg(r_d) lg(pb_r) lg(r_r)
## lg(pb_dstn) -0.330
## log(r_dstn) -0.083 -0.179
## lg(pb_rsdn) -0.343 0.013 0.063
## lg(r_rsdnc) -0.081 -0.004 0.221 -0.163
## log(dist) 0.015 0.001 -0.505 -0.064 -0.420
sqrt(mean(data$errors_rf^2))
## [1] 24.1214
sqrt(mean(data$errors_lm^2))
## [1] 98.71768
We can see that the RF model (RMSE: 24.1) out performs the gravity model (RMSE: 98.7) by a wide margin, when modeling the full data. In the variable importance plot, we can see that the gravity model variables (population, area, distance) are all among the most important, though the area vars and destination population are outranked by several of the land cover vars. This indicates, unsurprisingly, that human mobility in cities is influenced by the character of neighborhoods not just the density.
Another interesting note about the variable plot is that origin variables ("_residencia“) appear to be more important than destination ones (”_destino"). The chart below summarizes:
## # A tibble: 4 × 2
## variables pct_inc_mse
## <chr> <dbl>
## 1 Origin (All) 35.3
## 2 Destination (All) 21.8
## 3 Origin (Land Cover) 32.5
## 4 Destination (Land Cover) 20.1
The measure “pct_inc_mse” is the percent the mean squared error increases if a given variable is removed from the model. We can see that the origin variables are indeed better predictors, on average, than the destination variables. This indicates that human mobility in these cities is more dependent on “push” factors than “pull” ones.
The chart below reports the RMSE of predictions for each of the ten cities. The given city was treated as the “test” data while the other nine served as the “train” data, repeated for each one. The Madrid, Bilbao, and Malaga variables ranked highly in importance above, which indicates that those cities are particularly distinct in their mobility patterns, so I would expect the models to struggle with them. Also, RF struggles to make predictions for out-of-sample values, so I expect the linear model may do better with the larger cities.
## # A tibble: 10 × 4
## city rmse_rf rmse_lm better_predictions
## <chr> <dbl> <dbl> <chr>
## 1 Barcelona 86.3 67.3 "LM ***"
## 2 Madrid 112. 75.1 "LM ***"
## 3 Valencia 103. 116. "RF ***"
## 4 Sevilla 108. 114. "RF ***"
## 5 Cordoba 114. 110. "LM **"
## 6 Zaragoza 114. 113. "LM "
## 7 Las Palmas de Gran Canaria 138. 145. "RF ***"
## 8 Malaga 254. 276. "RF ***"
## 9 Bilbao 281. 306. "RF ***"
## 10 ALL CITIES 117. 102. "LM ***"
Considering the difference in data used and computing time required, the random forest predictions are disappointing. The most well-known disadvantage of RF is that it struggles to make predictions for out-of-sample values. Perhaps the cities where LM prevails are those which have more variables with maxs or mins outside of the training set.
## # A tibble: 9 × 3
## city oos_vars better_predictions
## <chr> <dbl> <chr>
## 1 Barcelona 18 "LM ***"
## 2 Madrid 16 "LM ***"
## 3 Sevilla 8 "RF ***"
## 4 Zaragoza 6 "LM "
## 5 Cordoba 6 "LM **"
## 6 Malaga 4 "RF ***"
## 7 Valencia 2 "RF ***"
## 8 Las Palmas de Gran Canaria 2 "RF ***"
## 9 Bilbao 0 "RF ***"
This helps explain why the RF model struggles when Barcelona and Madrid, at least, are the test cities.
The cities below have been selected based on the chart in the previous section in order to illustrate how the random forest regression model performs, compared to the gravity model, in various out-of-sample scenarios.
Here we see that the gravity model, which performs better, does a remarkable job considering what little data it uses and how much less computing time it takes than the random forest model. The RF model, on the other hand, seems to roughly capture the pattern of mobility in Barcelona but overshoots the amount of movement.
For Sevilla, the RF model makes significantly better predictions. Visually, it appears that the RF model better estimates that certain central districts of the city are a hub of mobility. The LM model assumes that the two southern-most districts, which are large and close to each other, will have large flows between them. In reality, flows from those districts to the center of the city are heavier, and the RF model does a better job capturing this.